Averaging Weights Leads to Wider Optima and Better Generalization
نویسندگان
چکیده
Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much broader optima than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and ShakeShake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization, and has almost no computational overhead.
منابع مشابه
Extended and infinite ordered weighted averaging and sum operators with numerical examples
This study discusses some variants of Ordered WeightedAveraging (OWA) operators and related information aggregation methods. Indetail, we define the Extended Ordered Weighted Sum (EOWS) operator and theExtended Ordered Weighted Averaging (EOWA) operator, which are applied inscientometrics evaluation where the preference is over finitely manyrepresentative works. As...
متن کاملاستفاده از یادگیری همبستگی منفی در بهبود کارایی ترکیب شبکه های عصبی
This paper investigates the effect of diversity caused by Negative Correlation Learning(NCL) in the combination of neural classifiers and presents an efficient way to improve combining performance. Decision Templates and Averaging, as two non-trainable combining methods and Stacked Generalization as a trainable combiner are investigated in our experiments . Utilizing NCL for diversifying the ba...
متن کاملOrthogonal Projection Weights in Dimension Reduction based on Partial Least Squares
Dimension reduction is important during the analysis of gene expression microarray data, because the high dimensionality in the data set hurts the generalization performance of classifiers. Partial least squares based dimension reduction (PLSDR) is a frequently used method, since it is specialized in handling high dimensional data set and leads to satisfying classification performance. However,...
متن کاملBidirectional Clustering of Weights for Finding Succinct Multivariate Polynomials
We present a weight sharing method called BCW and evaluate its effectiveness by applying it to find succinct multivariate polynomials. In our framework, polynomials are obtained by learning a three-layer perceptron for data having only numerical variables. If data have both numerical and nominal variables, we consider a set of nominally conditioned multivariate polynomials. To obtain succinct p...
متن کاملThe Induced Minkowski Ordered Weighted Averaging Distance Operator
The Minkowski distance is a distance measure that generalizes a wide range of other distances such as the Hamming and the Euclidean distance. In this paper, we develop a generalization of the Minkowski distance by using the induced ordered weighted averaging (IOWA) operator. We will call it the induced Minkowski OWA distance (IMOWAD). Then, we are able to obtain a wider range of distance measur...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2018